safety checkers
TokenProber: Jailbreaking Text-to-image Models via Fine-grained Word Impact Analysis
Wang, Longtian, Xie, Xiaofei, Li, Tianlin, Zhi, Yuhan, Shen, Chao
--T ext-to-image (T2I) models have significantly advanced in producing high-quality images. However, such models have the ability to generate images containing not-safe-for-work (NSFW) content, such as pornography, violence, political content, and discrimination. T o mitigate the risk of generating NSFW content, refusal mechanisms, i.e., safety checkers, have been developed to check potential NSFW content. Adversarial prompting techniques have been developed to evaluate the robustness of the refusal mechanisms. The key challenge remains to subtly modify the prompt in a way that preserves its sensitive nature while bypassing the refusal mechanisms. In this paper, we introduce T okenProber, a method designed for sensitivity-aware differential testing, aimed at evaluating the robustness of the refusal mechanisms in T2I models by generating adversarial prompts. Our approach is based on the key observation that adversarial prompts often succeed by exploiting discrepancies in how T2I models and safety checkers interpret sensitive content. Thus, we conduct a fine-grained analysis of the impact of specific words within prompts, distinguishing between dirty words that are essential for NSFW content generation and discrepant words that highlight the different sensitivity assessments between T2I models and safety checkers. Through the sensitivity-aware mutation, T okenProbergenerates adversarial prompts, striking a balance between maintaining NSFW content generation and evading detection. Our evaluation of T okenProberagainst 5 safety checkers on 3 popular T2I models, using 324 NSFW prompts, demonstrates its superior effectiveness in bypassing safety filters compared to existing methods ( e.g., 54%+ increase on average), highlighting T okenProber's ability to uncover robustness issues in the existing refusal mechanisms. The source code, datasets, and experimental results are available in [1]. Warning: This paper contains model outputs that are offensive in nature. The Text-to-Image (T2I) models have gained widespread attention due to their excellent capability in synthesizing high-quality images. T2I models, such as Stable Diffusion [2] and DALL E [3], process the textual descriptions provided by users, namely prompts, and output images that match the descriptions. Such models have been widely used to generate various types of images, for example, the Lexica [4] contains more than five million images generated by Stable Diffusion.
Antelope: Potent and Concealed Jailbreak Attack Strategy
Zhao, Xin, Chen, Xiaojun, Gao, Haoyu
Due to the remarkable generative potential of diffusion-based models, numerous researches have investigated jailbreak attacks targeting these frameworks. A particularly concerning threat within image models is the generation of Not-Safe-for-Work (NSFW) content. Despite the implementation of security filters, numerous efforts continue to explore ways to circumvent these safeguards. Current attack methodologies primarily encompass adversarial prompt engineering or concept obfuscation, yet they frequently suffer from slow search efficiency, conspicuous attack characteristics and poor alignment with targets. To overcome these challenges, we propose Antelope, a more robust and covert jailbreak attack strategy designed to expose security vulnerabilities inherent in generative models. Specifically, Antelope leverages the confusion of sensitive concepts with similar ones, facilitates searches in the semantically adjacent space of these related concepts and aligns them with the target imagery, thereby generating sensitive images that are consistent with the target and capable of evading detection. Besides, we successfully exploit the transferability of model-based attacks to penetrate online black-box services. Experimental evaluations demonstrate that Antelope outperforms existing baselines across multiple defensive mechanisms, underscoring its efficacy and versatility.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- (2 more...)
Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models
Ma, Jiachen, Cao, Anda, Xiao, Zhiqing, Zhang, Jie, Ye, Chao, Zhao, Junbo
Text-to-Image (T2I) models have received widespread attention due to their remarkable generation capabilities. However, concerns have been raised about the ethical implications of the models in generating Not Safe for Work (NSFW) images because NSFW images may cause discomfort to people or be used for illegal purposes. To mitigate the generation of such images, T2I models deploy various types of safety checkers. However, they still cannot completely prevent the generation of NSFW images. In this paper, we propose the Jailbreak Prompt Attack (JPA) - an automatic attack framework. We aim to maintain prompts that bypass safety checkers while preserving the semantics of the original images. Specifically, we aim to find prompts that can bypass safety checkers because of the robustness of the text space. Our evaluation demonstrates that JPA successfully bypasses both online services with closed-box safety checkers and offline defenses safety checkers to generate NSFW images.
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- Africa (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
State-of-the-Art in Nudity Classification: A Comparative Analysis
Akyon, Fatih Cagatay, Temizel, Alptekin
Stable This paper presents a comparative analysis of existing nudity Diffusion safety checker is designed to prevent unsafe image classification techniques for classifying images based generation, while LAION safety checker works by filtering on the presence of nudity, with a focus on their application out unwanted images from the training set to prevent diffusion in content moderation. The evaluation focuses on models being trained on inappropriate images. These CNN-based models, vision transformer, and popular opensource safety checkers demonstrate the growing importance of developing safety checkers from Stable Diffusion and Largescale effective and accurate image classification systems Artificial Intelligence Open Network (LAION).
- Media > Film (0.47)
- Leisure & Entertainment (0.47)